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Abstract 

We propose a new algorithm to solve optimization problems of the form min f{X) for 
a smooth function / under the constraints that X is positive semidefinite and the diagonal 
blocks of X are small identity matrices. Such problems often arise as the result of relaxing a 
rank constraint (lifting). In particular, many estimation tasks involving phases, rotations, 
orthonormal bases or permutations fit this framework, and so do certain relaxations of 
combinatorial problems such as Max-Cut. The proposed algorithm exploits the facts 
that (1) such formulations admit low-rank solutions, and (2) their rank-restricted versions 
are smooth optimization problems on a Riemannian manifold. Combining insights from 
both the Riemannian and the convex geometries of the problem, we characterize when 
second-order critical points of the smooth problem reveal KKT points of the semidefinite 
problem. In particular, we bound the ranks that need to be considered, deterministically. 
Comparison against state of the art, mature software shows that, on certain interesting 
problem instances, what we call the staircase method is orders of magnitude faster, is 
more accurate and scales better. Code is available. 
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1 Introduction 


This paper considers the generic problem of estimating matrices Y \,... ,Y m £ M dxp , d < p, 
with orthonormal rows, that is, such that Yj-Y, T = I d (identity of size d) for all i. We further 
focus on problems where only relative information is available, that is, information about YfYj 
for some of the pairs ( i , j ) is available, but there is no information about individual Yj’s. As will 
be detailed below, particular cases of this come up in a number of applications. For example, 
when p = d = 1, the variables F) reduce to {±1} and the products YfYj indicate whether Y t 
and Yj have the same sign or not, allowing to model certain combinatorial problems. When 
d = 1 ,p> 1, the variables reduce to unit-norm vectors, and the products correspond to inner 
products between them, allowing to model correlations and proximity on a sphere. Finally, 
when d = p > 1, the matrices are orthogonal, and the products Y t Yj = YjYf ~ 1 represent 
relative orthogonal transformations, such as rotations, reflections and permutations. 

For ease of notation, we stack the orthonormal matrices on top of each other to form 
Y £ R nxp , n = md > p. Then, X = YY T is a block matrix whose block X t] £ R dxd 
corresponds to the relative product YfYj. Define the (transposed) Stiefel manifold as 

St (d,p) = {Z £ R dxp : ZZ T =I d }, 
and the set of y’s obtained by stacking as 

st(d, p) m = { Y £ M mdx P : Y T = (yT Y 2 t ■■■ yj) and Y 1} ..., Y m £ St(d, p)} 

= |y £ R nxp : (yy T )„; = I d for i = 1... m} . (1) 

This paper is concerned with solving optimization problems of the form 

min g(Y) = f(YY T ). subject to Y £ St (d,p) m , (RPp) 

Ye K nx p 

with twice continuously differentiable cost /: § nxn R defined over the symmetric matrices. 
Here, g is for example the negative likelihood of Y with respect to available data. The 
restriction that g(Y) be only a function of YY T encodes the property that only relative 
information is available, through ( YY T )ij = YfY ) T . This induces invariance of the cost under 
right-action of the orthogonal group. Indeed, g(Y Q) = g{Y) f° r an y orthogonal matrix Q of 
size p. Thus, solutions of (RP p ) are only defined up to this group action. 

Problem (RP p ) is computationally hard. In particular, for d = p = 1 and linear /, it covers 
the NP-hard Max-Cut problem [36]. Following that and other previous work [13, 66, 9], we 
consider a relaxation through the following observation. For all Y £ St(d, p) m , the matrix 
X = yy T is positive semidefinite, its diagonal blocks Xa are identity matrices I d , and it has 
rank at most p. Conversely, any matrix X with those properties can be factored as YY T with 
y £ St(d, p) m . In other words, problem (RP p ) is equivalent to optimizing / over the convex 
set 


C = {X £ S nxn : X h o and X l% = I d for * = 1... m) , (2) 

with the additional constraint rank(X) < p. As often, the rank constraint is the culprit. 
Indeed, continuing with the Max-Cut example (linear /), optimization over C without the rank 
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constraint is a semidefinite program , which can be solved to arbitrary precision in polynomial 
time [76]. 

This is motivation to study the relaxation obtained by ignoring the rank constraint (for 
linear /, it is also the dual of the dual of (RPp)): 

min f(X), subject to IgC. (P) 

jveR nx " 

The optimal cost of (P) is a lowerbound on that of (RP p ). Furthermore, if (P) admits a 
solution X = YY t with Y E M nxp , that is, a solution of rank at most p, then Y is a solution 
of (RP p ). When that is not the case, a higher-rank solution X may still be projected to a 
(hopefully good) initial guess for the nonconvex problem (RP p ). See [49, 9] for a discussion of 
approximation results related to these projections. The price to pay is that C is much higher 
dimensional than St (d,p) m : this is called a lift [13]. 

For linear /, solutions of (P) can of course be computed using standard SDP solvers, such 
as interior point methods (IPM). Unfortunately, as demonstrated in Section 5, IPM’s do not 
scale well. The main reason for it is that, as the name suggests, IPM’s iterate inside the 
interior of the search space C. The latter is formed by full-rank, dense matrices of size n: this 
quickly becomes unmanageable. 

The full-rank operations seem even more wasteful considering that, still for linear /, 
problem (P) always admits a solution of rank at most 

* a/1 + 4:m,d(d + 1) - 1 

p =— --- < (d + l)y/m <C n. (3) 

Indeed, this follows a general result of Shapiro [62], Barvinok [12] and Pataki [53] regarding 
extreme points of spectrahedra, 1 that is, intersections of the positive semidefinite cone with 
an affine subspace—the geometry of C is discussed in Section 2.2. This prompted Burer 
and Monteiro [19, 20] to propose SDPLR, a generic SDP solver which exploits the low-rank 
phenomenon. Applying SDPLR to our problem amounts to computing a local minimizer Y 
of (RPp) for some small p, using classical nonlinear optimization algorithms and penalizing 
for the constraints in a Lagrangian way. Then, p is increased as needed until YY T can be 
certified as a solution to the SDP. 

SDPLR is powerful and generic, and the theory accompanying the algorithm brings great 
insight into the problem. But it also has some downsides we want to improve on in the context 
of (P). First, it is not an easy matter to guarantee convergence to (even local) optimizers 
in the nonlinear subproblems. Furthermore, since constraints are enforced by penalization, 
they are not accurately satisfied by the returned solution. Finally, we would like to allow for 
nonlinear /. Nevertheless, Section 5 shows SDPLR improves significantly upon IPM’s. 

Journee et al. [40] build upon SDPLR, observing that certain SDP’s harbor an elegant 
Riemannian geometry that can be put to good algorithmic use. In particular, they cover 
what here corresponds to the case d = 1 and observe that, as remains true for d > 1, (RPp) 
is an optimization problem on a smooth space: St (d,p) m is a Riemannian manifold —this 
geometry is detailed in Section 2.1. Allowing for smooth nonlinear /, they apply essentially the 
SDPLR machinery, replacing the nonlinear programming algorithms for (RPp) by Riemannian 

1 The name spectrahedron for the search space of a semidefinite program echoes the name polyhedron for 
the search space of a linear program. 
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optimization algorithms [4]. These algorithms exploit the smooth structure of the nonconvex 
search space, resulting in constraint satisfaction up to numerical accuracy, as well as notable 
speedups. 

As a further refinement, Journee et al. [40] address the invariance of / under orthogonal 
group action. Instead of optimizing f(YY T ) over St (d,p) m , they optimize over the quotient 
space St(d,p)™/~, where ~ is an equivalence relation defined over St(d,p)™ (the full-rank 
elements of St {d,p) m ) by Y ~ Y YY T = YY T . The advantage is that this quotient space, 
which is still a smooth Riemannian manifold, is now one-to-one with the rank-p matrices 
in C. Unfortunately, the geometry breaks down at rank-deficient K’s (to see this, notice 
that equivalence classes of different rank have different dimension; see also Figure 1). The 
breakdown is problematic since, as will become clear, it is desirable to converge to rank- 
deficient K’s. Furthermore, that paper too asks for computation of local optimizers of the 
subproblems, which, on Riemannian manifolds too, is a difficult task. 

In both [20] and [40], one of the keys to practical efficiency is (well-justified) optimism: 
(RPp) is first solved for small values of p, and p is increased only as needed. In both papers, 
it is observed that, in practice, it often suffices to reach p just above the rank of the target 
solution of (P), which may be quite small; but there is no theory to confirm this. We do not 
prove such a strong result either, but we give some nontrivial, deterministic bounds on “how 
high one must lift”, refining certain results of [20] . 

1.1 Contribution 

In this paper, we describe the Riemannian geometry of St (d,p) m in order to frame (RPp) as a 
Riemannian optimization problem. We use existing algorithms [4] and the Manopt toolbox [17] 
to compute critical points of (RP p ), that is, points where the (Riemannian) gradient of the 
cost function vanishes. In practice, those algorithms tend to converge to second-order critical 
points, that is, points where the (Riemannian) Hessian is also positive semidehnite, because 
all other critical points are unstable fixed points of the iteration. 

For p > d, St (d,p) m is a connected, 2 compact and smooth space. Since we further assume 
sufficient smoothness in / too, this makes for a nice problem with no delicate limit cases to 
handle. Furthermore, Riemannian optimization algorithms iterate on the manifold directly: 
all iterates satisfy constraints up to numerical accuracy. 

We then turn our attention to computing Karush-Kuhn-Tucker (KKT) points for (P). 
These are points that satisfy first-order necessary optimality conditions. If / is convex, the 
conditions are also sufficient. Our goal is to compute KKT points via the computation of 
second-order critical points of (RP p ), which is lower-dimensional. A key property that makes 
this possible is the availability of an explicit dual matrix S(X) (21) which intervenes in both 
sets of conditions. 

Using this dual matrix, we show that rank-deficient second-order critical points Y reveal 
KKT points X = YY T . Furthermore, when a computed second-order critical point is full 
rank, it is shown how to use it as a warm-start for the computation of a second-order critical 
point of (RP p ) with a larger value of p. It is guaranteed that if p is allowed to grow up to 
n, then all second-order critical points reveal KKT points, so that the procedure terminates. 

2 For p = d, St (d,p) m has 2 m disconnected components, because the orthogonal group has two components: 
matrices with determinant +1 and —1. This is a strong incentive to relax at least to p = d + 1. 
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This is formalized in Algorithm 1, which we call the Riemannian Staircase, as it lifts (RP p ) 
to (P) step by step, instead of all at once. 

The above points rest extensively on work discussed earlier in this introduction [19, 20, 40], 
and improve upon those along the lines announced in the same. In particular, we do not require 
the computation of local optimizers of (RP p ), and we avoid the geometry breakdown tied to 
the quotient approach in [40]. We also stress that the latter reference only covers d = 1, and 
SDPLR only covers linear /. 

We further take particular interest in understanding how large p may grow in the staircase 
algorithm. We view this part as our principal theoretical contribution. This investigation calls 
for inspection of the convex geometry of C, with particular attention to its faces and their 
dimension. To this effect, we use results by Pataki [53] to describe the face of C which contains 
a given X in its relative interior, and we quote the lower-bound on the dimension of that face 
as a function of rank(A). We further argue that this bound is almost always tight, and we give 
an essentially tight upper bound on the dimension of a face, generalizing a result of Laurent 
and Poljak [42] to d > 1. 

Using this facial description of C, we establish that for strongly concave /, for p > p* (3), 
all second-order critical points of (RP p ) reveal KKT points. Also, for concave /, we show the 
same for p > (Corollary 3.16), and argue that p > p* is sufficient under an additional 

condition we believe to be mild. Hence, 

For linear f, above a certain threshold for p, all second-order critical points of (RP p ) 
are global optimizers. 

The condition is stronger than the one proposed in [20] , and the statement is about second- 
order critical points, rather than about local optimizers of (RP p ). There are no similar results 
for convex /, as then solutions can have any rank. 

We close the paper with numerical experiments showing the efficiency of the staircase algo¬ 
rithm to solve (P) on certain synchronization problems involving rotations and permutations, 
as compared to IPM’s and SDPLR. 

Note that, up to a linear change of variable, problem (P) also encompasses constraints of 
the form Xu = Bi where each Bi is positive definite. We assume all diagonal blocks have 
identical size d as this simplifies exposition, but the proposed method can easily accommodate 
inhomogeneous sizes, and many of the developments go through for complex matrices as well. 

1.2 Applications 

Problem (RP p ) and its relaxation (P) appear in numerous applications. Many of those belong 
to the class of synchronization problems, which consist in estimating group elements from 
measurements of pairwise ratios. Further applications are also described, e.g., in [65, 49, 9]. 

Combinatorial problems can be modeled in (RP p ) with d = p = 1. A seminal example 
is Max-Cut: the problem of clustering a graph in two classes, so as to maximize the sum 
of weights of edges joining the two classes. The cost / is linear, determined by the graph’s 
adjacency matrix. Its relaxation to (P) is the subject of an influential analysis by Goemans 
and Williamson [36], which helped popularize the type of lifts considered here. See [31, 
eq. (3)] for a recent application of Max-Cut to genomics. The same setup, but with different 


5 


linear costs, appears in the stochastic block model [1], in community detection [27], 
in maximum a posteriori (MAP) inference in Markov random fields with binary 
variables and pairwise interactions [35] and in robust PC A [46, Alg. 1], All of these study 
the effects of the relaxation on the final outcome, mostly under random data models. Their 
linear cost matrices are often structured (sparse or low-rank), which is easily exploited here. 

Spherical embeddings is the general problem of estimating points on a sphere in M p , and 
appears notably in machine learning for classification [81] and in the fundamental problem 
of packing spheres on a sphere [26]. It is modeled by (RP p ) with d = l,p > 1. The 
same setup also models correlation matrix completion and approximation [37]. In the 
latter, an algorithm to solve (P) is proposed, which inspired [40], which inspired this work. 

Synchronization of rotations is the problem of estimating m rotation matrices (orthog¬ 
onal matrices with determinant 1, to exclude reflections), based on pairwise relative rota¬ 
tion measurements. It is modeled in (RP p ) with d = p > 1 (often, 2 or 3) and comes up 
in structure from motion [7], pose graph estimation [21], global registration [24], 
the generalized Procrustes problem [72] and simultaneous localization and map¬ 
ping (SLAM) [22], It serves in global camera path estimation [18], scan alignment [15, 78], 
and sensor network localization and the molecule problem [29, 30]. In many of these prob¬ 
lems, translations must be estimated as well, and it has been shown in practical contexts 
that rotations and translations are best estimated separately [22, Fig. 1], Here, (RP p ) can 
easily accommodate the determinant constraint: it comes down to picking one of the con¬ 
nected components of St (d,p) m , as in [16]. The relaxation (P) ignores this, though; see [60] 
for relaxations which explicitly model this difference (at additional computational cost). The 
problem of estimating orthogonal matrices appears notably in the noncommutative little 
Grothendieck problem [49, 9]. In the latter, the relaxation (P) with linear / is called 
Orthogonal-Cut, and its effect on (RPp) is analyzed. The same relaxation with a nons¬ 
mooth cost, for robust estimation, is proposed and analyzed in [78]. See also [8] for another 
robust formulation of the same problem, based on low-rank-plus-sparse modeling. 

The common lines problem in Cryo-EM is an important biomedical imaging instance 
of (RPp), where orthonormal matrices are to be estimated with d = 2,p = 3 [79]. 

Phase synchronization and recovery can be modeled with p = d = 2 (as phases are 
rotations in M 2 ). It is sometimes attractive to model phases as unit-modulus complex numbers 
instead, as is done in [65] for phase synchronization, with the same SDP relaxation. This can 
be used for clock synchronization. See [10] for a study of the tightness of this SDP, and [28] 
for an application to ranking. The Phase-Cut algorithm for phase recovery uses the same 
SDP [77], with a different linear cost. While not explicitly treated, many of the results in this 
paper extend to the complex case. 

1.3 Related work 

Problem (RPp) is an instance of optimization on manifolds [4]. Optimization over orthonormal 
matrices is also studied in, e.g., [34, 80]. Being equivalent to (P) with a rank constraint, (RPp) 
also falls within the scope of optimization over matrices with bounded rank [61, 47], where 


6 


the latter is also an extension of [40]. The particular case of optimization over bounded-rank 
positive semidefinite matrices with linear constraints was already addressed in [71] . The same 
without positive semidefiniteness constraint is studied recently in [44] , also with a discussion of 
global optimality of second-order critical points. With a linear cost /, problem (RP p ) (which 
then has a quadratic cost g) is a subclass of quadratically constrained quadratic programming 
(QCQP). QCQP’s and their SDP relaxations have been extensively studied, notably in [50, 67], 
with particular attention to approximation ratios. For (P), these approximation ratios can be 
found in [9]. 

In part owing to the success of (P) with linear / in adequately solving a myriad of hard 
problems, there has been strong interest in developing fast, large-scale SDP solvers. The 
present paper is one example of such a solver, restricted to the class of problems (P). SDPLR 
is a more generic such solver [19, 20]. See also [75] for a review, and [32] for a recent low- 
complexity example with precise convergence results, but which does not handle constraints. 

Much of this paper is concerned with characterizing the rank of solutions of (P), especially 
with respect to how large p must be allowed to grow in (RP p ) to solve (P). There is also 
considerable value in determining under what conditions (P) admits solutions of the desired 
rank for a specific application, that is: when is the relaxation tight? This question is partially 
answered in [10] for the closely related phase synchronization problem, under a stochastic 
model for the data. See [1] for a proof in the stochastic block model, and [6] for a study 
of phase transitions in random convex programs. There also exist deterministic tightness 
results, typically relying on special structure in a graph underlying the problem data. See for 
example [68, 60, 58]. See also Appendix A for a deterministic proof of tightness in the case of 
single-cycle synchronization of rotations. The proof rests on the availability of a closed-form 
expression for the dual matrix S (21), and for the solution to be certified. With the same 
ingredients, it is easy to show, for example, that (P) is tight for Max-Cut when the graph is 
bipartite. 

Semidefinite relaxations in the form of (P) with additional constraints have also appeared 
in the literature. In particular, this occurs in estimation of rotations, with explicit care for 
the determinant constraints: Saunderson et al. [60] explicitly constrain off-diagonal blocks to 
belong to the convex hull of the rotation group; this is not necessary for the orthogonal group— 
see Proposition 2.1. Similarly, for synchronization of permutations in joint shape matching, 
off-diagonal blocks are restricted to be doubly stochastic [25, 38]. Finally, in recent work, 
Bandeira et al. [11] study a more powerful class of synchronization problems with additional 
linear constraints of various forms. An example with an additional nonlinear constraint 
appears in [78], which imposes an upperbound on the spectral norm of X. All of these are 
motivation to generalize the framework studied here, in future work. 

We mention in passing that the MaxBet and MaxDiff problems [73] do not fall within the 
scope of this paper. Indeed, although they also involve estimating orthonormal matrices as 
in (RP p ), their cost function has a different type of invariance, which would also lead to a 
different type of relaxation. 

1.4 Notation 

The size parameters obey l<d<p<n= md. Matrices A E M nxn are thought of as block 
matrices with blocks of size d X d. Subscript indexing such as refers to the block on the 
ith row and jth column of blocks, 1 < i, j < m. For Z E M nxp , Z t refers to the ith slice of size 
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d x p, 1 < i < m. The Kronecker product is written <8> and vec vectorizes a matrix by stacking 
its columns on top of each other. A real number a is rounded down as |_aj • The operator 
norm ||A|| op = U m ay (A) is the largest singular value of a matrix, and its Frobenius norm ||A||f 
is the f^-norm of vec(A). S nxn is the set of symmetric matrices of size n, and A Y 0 means 
A E § nxn is positive semidefinite. syrn(A) = {A + A T )/2 extracts the symmetric part of a 
matrix. 0(d) is the group of orthogonal matrices of size d. ker£ denotes the null-space, or 
kernel, of a linear operator. 

2 Geometry 

Both search spaces of (RP p ) and (P) enjoy rich geometry, which leads to efficient analytical 
and numerical tools for the study of these optimization problems. The former is a smooth 
Riemannian manifold, while the latter is a compact convex set. Figure 1 depicts the two. 

2.1 Smooth geometry of the rank-restricted search space 

Endow M nxp with the classical Euclidean metric (Ui,U 2 ) = trace (UjU 2 ), corresponding to 
the Frobenius norm: HGIf = (U,U). We view the search space of (RP p ) as a submanifold 
of R nxp and endow it with the Riemannian submanifold geometry [4], First, define a linear 
operator symblockdiag: R nxn —> § nxn which symmetrizes diagonal blocks and zeroes out all 
other blocks: 

( M ii + M 7i A — A 

symblockdiag(M)-■ = < 2 ’ (4) 

I 0 otherwise. 

This allows for a simple definition of the manifold via an equality constraint as 

St (d,p) m = jl" £ R nxp : symblockdiag(yy T ) = I n j . (5) 

The set is non-empty if p > d. It is connected if p > d. Counting dimensions yields 
dimSt(d, p) m = np — md(d + l)/2. The tangent space to St (d,p) m at Y is a subspace of 
R nxp obtained by differentiating the equality constraint: 

T y St(d,p) m = |y E R nxp : symblockdiag (YY T + YY T ^j = o} . (6) 

Among the tangent vectors are all vectors of the form Yil. for 17 E R pxp skew-symmetric: 
these correspond to “vertical directions”, in the sense that following them does not affect the 
product yy T (at first order). Each tangent space is equipped with a restriction of the metric 
(•,•}, thus making St (d,p) m a Riemannian submanifold of R nxp . The orthogonal projector 
from the embedding space R nxp to the tangent space at Y is 

Proj y (Z) = Z — symblockdiag(yy T )y. (7) 

The total computational cost of a projection is thus 0(m ■ d 2 p) = 0(ndp) flops. 

The optimization problem (RP p ) involves a function g(Y) = f(YY T ) defined over R nxp . 
Denote its classical, Euclidean gradient at Y as X7g(Y). The Riemannian gradient of g at 
y, grad< 7 (y), is defined as the unique tangent vector at Y such that, for all tangent Y , 




Figure 1: (Left) For m = 3, d = 1, the set C (2) contains all positive semidefinite matrices of 
the form X = [1, a, b\ a, 1, c; b, c, 1]. It is here represented in coordinates (a, b, c ). The interior 
of the shell contains the rank-3 matrices; the four extreme points (black dots) are the rank-1 
matrices; and the remainder of the boundary is the (smooth) set of rank-2 matrices. That 
smooth geometry breaks down at the rank-1 matrices. Note that for d > 1, extreme points of 
rank d are no longer isolated. (Right) For m = 3, d = 1 ,p = 2, the set St (d,p) m (1) parame¬ 
terizes the matrices of rank at most 2 in C (redundantly). In this case, St (d,p) m corresponds 
to a product of three circles in 2D: Y = [cos au, sin au; cos « 2 , sin cos 03 , sin 0 : 3 ] E St (d,p) m . 
One of these degrees of freedom is redundant, because the factorization of X € C as YY T is 
not unique. The figure represents the remaining degrees of freedom after fixing ai = 0. Notice 
how accepting the redundancy in the parameterization allows for a smooth representation of 
the nonsmooth set of bounded rank matrices in C. Color codes for ||-Y||f = ||TT t ||f- 
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(grad g(Y), Y) = (Vg(Y), Y). Naturally, this is given by the projection of the classical gradient 
to the tangent space [4, eq. (3.37)]: 

gradg(Y) = Proj y (Xg(Y)) = 2 Proj y ( Vf(YY T )Y^j , ( 8 ) 

where V/(X) is the classical gradient of /, a symmetric 3 matrix of size n, and we used (10). 
Furthermore, denote by X 2 g(Y) the classical Hessian of g at Y. This is a symmetric operator 
on M nxp . The Riemannian Hessian of g at Y is a symmetric operator on the tangent space 
at Y obtained as the projection of the derivative of the gradient vector field [4, eq. (5.15)]: 

Hess g(Y) [Y] = Proj y (d(T ^ Proj y (Vg(Y ))) (F)[F]) 

= Proj y (x 2 g(Y ) [Y] - symblockdiag (V 5 (y )T t ) y') , (9) 

where D denotes a classical directional derivative and we used Proj y o Proj y = Proj y . For 
future reference, we note these expressions of the derivatives of g in terms of those of /: 

Vg(Y) = 2V/P0Y, and (10) 

X 2 g(Y)[Y] = 2 (v 2 f{X)[X]Y + V/(X)y) , with X = YY T +YY T . (11) 

Optimization algorithms on Riemannian manifolds typically are iterative. As such, they 
require a means of moving away from a point Y along a prescribed tangent direction Y. to 
reach a new point on the manifold: the next iterate. Since Y + Y does not, in general, belong 
to the manifold, extra operations are required. Retractions achieve exactly this [4, §4.1], One 
possible retraction for (1) is as follows. For each d x p “slice” i in {1,... , m}, 

(detraction}- (y)) = U^, with y + Y t = UfoV?, (12) 

where is a thin singular value decomposition of the zth slice Yj + Y t . This retraction 

projects each slice of Y + Y to the closest orthonormal matrix. Consequently, this is even a 
second-order retraction [2]. The total cost of computing a retraction is 0(m ■ ( p 2 d + d 3 )) = 
0(np 2 ) flops. 

2.2 Convex geometry of the full search space 

The optimization problem (P) is defined over the compact convex set 

C = {X € § nxn :I80 and X lt = I d for * = 1... m}. 

For d = 1. this is the elliptope , or set of correlation matrices [42]. Often, we hope to recover 
matrices X in C such that X has rank d, or such that off-diagonal blocks X^j are orthogonal. 
The following proposition shows that these two considerations are equivalent, and that the 
convex relaxation leading to C is tight in that sense (the tightest relaxation would consider 
the convex hull of rank-d matrices in C, but this is difficult to handle 4 ). 

3 V f(X) is symmetric because / is formally defined over the symmetric matrices. If the gradient of / over 
the square matrices is not symmetric, V/(A') is obtained by extracting its symmetric part. 

4 Let C be the convex hull of rank-d matrices in C. The extreme points of C are these matrices [56, Cor. 18.3.1]. 
Thus, optimizing a linear cost function over C solves (RPd), which is NP-hard. Hence, there probably does 
exist an efficient representation of C. 
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Proposition 2.1. For all X E C, all blocks Xij are in the convex hull of 0(d), that is, 
o’ma.x(Xij) < 1. Furthermore, rank(A') = d if and only if X^ E 0(d) for all i,j. 

Proof. Since X is positive semidefinite, for all i j, the submatrix formed by the blocks 
Xu, X^, Xji and Xjj is positive semidefinite. By Schur, this holds if and only if Xj-X- < Ij, 
which in turn happens if and only if all singular values of Xjj are at most 1. The set of such 
matrices is the convex hull of all orthogonal matrices of size d [59]. 

Consider Y E St (d,p) m such that rank(A') = p and X = YY T . Clearly, if p = d, 
X^ = YfYj is orthogonal, since all Y^’s are orthogonal. Conversely, if X^ is orthogonal, then 
the rows of Y t and Yj span the same subspace. Indeed, YfYjY-Y^ = Id- Multiply by Y % on 
the right. Now notice that YjY k is an orthogonal projector onto the subspace spanned by the 
rows of Yj,. Since Y t remains unaffected by such a projection first on the subspace of Yj then 
again on the subspace of Y i} they must span the same subspace. Hence, fixing j = 1, for each 
i, there exists Qi E 0(d) such that Y % = Q t Y\. Finally, Y = diag (la, Q 2 , ■ ■ ■, Qm)(lmxi <8> Y\) 
(where <g> is the Kronecker product), which confirms that Y and X have rank d. Notice that 
this proof further shows that X has rank d if and only if there is a spanning tree of edges 
(i,j) on an m-nodes graph such that the X t j's are orthogonal (in which case they are all 
orthogonal). □ 

The set C may be decomposed into faces of various dimensions. 

Definition 2.1 (faces, §18 in [56]). A face of C is a convex subset T of C such that every 
(closed) line segment in C with a relative interior point in T has both endpoints in T. The 
empty set and C itself are faces of C. 

By [56, Thm. 18.2], the collection of relative interiors 5 of the non-empty faces forms a 
partition of C. That is, each X E C is in the relative interior of exactly one face of C, called 
Fx ■ Furthermore, all faces of C are exposed [55, Cor. 1], that is, for every face F, there exists 
a linear function / such that F is the set of solutions of (P). Of particular interest are the 
zero-dimensional faces of C (singletons), called its extreme points. 

Definition 2.2 (Extreme and exposed points). X E C is an extreme point of C if there does 

not exist X',X" E C\{A} and 0 < A < 1 such that X = XX' + (1 — X)X". X is an exposed 

point of C if there exists C such that X is the unique maximizer of (C, X) in C. 

In other words, X is extreme if it does not lie on an open line segment included in C. Since 
C is compact, it is the convex hull of its extreme points [56, Cor. 18.5.1]. Extreme points are of 
interest notably because they often arise as the solution of optimization problems. Specifically, 
if / is a concave function (in particular, if / is linear), then / attains its minimum on C at 
one of its extreme points [56, Cor. 32.3.2], 

Following the construction in the proof of [53, Thm. 2.1], given Y E M nxp of full rank such 
that X = YY t (rank(A) = p), we find that 

Fx = |A” = Y(I p + A)Y t : A E ker C\ and I p + A F 0 j , with (13) 

Cx ■ s pxp (S dxd r: A i-> C X (A) = ( Y,AY L T , • • • , Y m AY^ . (14) 

5 The relative interior of a singleton is the singleton. 
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The dimension of J~x is the dimension of the kernel of Cx • The rank-nullity theorem gives a 
lowerbound (see also Theorem 3.15 for an upperbound): 

dim = pip f 11 - rank Cx > 11 - m d(d f 11 = A. (15) 

It follows that extreme points A (i.e., points such that dim Ax = 0) have small rank: 

d < rank(A') < p* := + 4 md(d + 1) — 1^ /2. (16) 

Note that A > 0 when p > p*. 

Remark 2.2. For linear f, (P) admits an extreme point as global optimizer, so that (RP p ) 
and (P) have the same optimal value as soon as p > p* (16). In other words: for linear 
f, (RP p ) is not NP-hard if p > p*. 

Not all feasible A’s with rank as in (16) are extreme. For example, setting d = 1 and 
m > 3 as in Figure 1, select two distinct, admissible matrices of rank 1, Xq and X\. For all 
0 < A < 1, the matrix X\ = \X\ + (1 — A)Ao, lying on the open line segment between Ao and 
X\, is admissible and has rank 2. Thus, X\ satisfies (16), but it is not an extreme point, by 
construction. Notwithstanding, the expectation that Cx is generically of full rank suggests 
that almost all feasible A’s satisfying (16) should be extreme; an intuition that is supported 
by Figure 1. More generally, in Theorem B.l, we prove for d = 1 that dim Ax = A for almost 
all X of rank p. 

Many applications look for solutions of rank d. All A’s of rank d are exposed (hence 
extreme), meaning they can all be recovered as unique solutions of (P). 

Proposition 2.3. For all X £ C, ||A||| < m 2 d. Furthermore, ||A||| = m 2 d if and only if 
rank(A) = d. In particular, each X e C of rank d is an exposed extreme point of C. 

Proof. Let o\(X t j) > ■■■ > cr^Xfj) denote the singular values of Xij. By Proposition 2.1, 

1 7k(Xij ) < 1 for all i,j,k. Hence, 

m m d 

|j AH = £ \\Xij ||| = £ J2°k(Xij) < m 2 d. (17) 

i,j =1 i,j =1 fc=l 

The upperbound is attained if and only if ak(Xij) = 1 for all i,j, k, thus, if and only all Ay’s 
are orthogonal. By Proposition 2.1, this is the case if and only if rank(A) = d. Now consider 
X has rank d. We show it is exposed (and hence extreme): 

max (A, A) < max (A, A) < ||A||f • max ||A||f = m 2 d. (18) 

X£C II A||p<m 2 d I! A||p<m 2 d 

The second inequality follows by Cauchy-Schwarz, and equality is attained if and only if 
A = X , which is in C. Thus, both max problems admit A as unique solution, confirming 
that A is an exposed extreme point. □ 

Remark 2.4. Proposition 2.3 is an extension of [45, Thm. 1] to the case d > 1. We note 
that the proof in that reference does not generalize to d > 1. 
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3 From second-order critical points Y to KKT points X 


In this section, we show that rank-deficient second-order critical points Y of (RP p ) yield KKT 
points X = YY t of (P). Furthermore, when Y is second-order critical but X is not KKT, 
it is shown how to escape the saddle point by increasing p. If p increases all the way to n, 
then all second-order critical points reveal KKT points. The proofs parallel those in [40]. The 
main novelty is explicit bounds on p such that all second-order critical points of (RP p ) reveal 
KKT points. The proofs bring us to consider the facial structure of C. 

A key ingredient for all proofs in this section is the availability of an explicit matrix 
S(X) (21) which is positive semidefinite if and only if X is KKT (Theorem 3.3). The for¬ 
mula for S is simply read off from the first-order optimality conditions of (RP p ), owing to 
smoothness of the latter. 

Lemma 3.1 (Necessary optimality conditions for (P)). X G C is called a KKT point for (P) 
if there exist a symmetric matrix S G § nxn and a symmetric, block-diagonal matrix A G § nxn 
(dual variables) such that 

SX = 0, S = Vf(X) + A, and S Y 0. 

If X is a local optimizer for (P), then X is a KKT point. If f is convex, all KKT points are 
global optimizers. 

Proof. Apply Theorems 3.25 and 3.34, and Example 3.36 in [57]. KKT conditions are neces¬ 
sary since Slater’s condition holds: I n is feasible for (P) and it is strictly positive definite. □ 

Lemma 3.2 (Necessary optimality conditions for (RP p )). Let Y G St (d,p) m and X = YY T . 
A critical point of (RP p ) satisfies grad g(Y) = 0, that is, 

(V/(X) - symblockdiag(V f(X)X) ) Y = 0. (19) 

A second-order critical point is a critical point which satisfies Hess g(Y) — that is, 

for all Y G TySt {d,p) m , (y, V 2 g(Y)[Y] - symblockdiag(Vc/(K)K T ) Y ^ > 0 . ( 20 ) 

IfY is a local optimizer for (RP p ), then it is a second-order critical point. 

Proof. This is a direct generalization of the classical necessary optimality conditions for un¬ 
constrained optimization [14, Prop. 1.1.1] to the Riemannian setting, as per the formalism 
in [4, 82], Use equations (7), ( 8 ) and (9) and the fact that Projy is self-adjoint to obtain 
equations (19) and (20). □ 

Lemmas 3.1 and 3.2 suggest the definition of an (as yet merely tentative) formula for the 
dual certificate S, based on (19): 

5 = S(X) = V/p0 - symblockdiag(V/(A)A). ( 21 ) 

Indeed, for any critical point Y of (RP p ), it holds (with X = YY T ) that SY = 0, so that 
SX = 0. In that case, for p = d, S can be advantageously interpreted as a graph Laplacian 
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(up to a change of variable 6 ), as is often the case for dual certificates of estimation problems 
on graphs [10, 33]. We now show that S(X) is indeed the unique possible dual certificate for 
any feasible X. (This is similar to, but different from, Theorem 4 in [40]; a complex version 
appears in [10] for d = 1.) 

Theorem 3.3 (S is the right certificate). X £ C is a KKT point for (P) if and only if S (21) 
is positive semidefinite. If so, S = S is the unique dual certificate for Lemma 3.1. 

Proof. We show the if and only if parts of the first statement separately. 

<=: By construction, trace(SA) = 0. Since both S and X are positive semidefinite, this 
implies SX = 0. Apply Lemma 3.1 with S = S and A = — symblockdiag(V/(A)A). 

=>: Since A is a KKT point for (P), there exist 5^0 and A symmetric, block-diagonal 
satisfying the conditions in Lemma 3.1. In particular, SX = 0 and V/(A) = S — A. 

Thus, V/(A)A = —AX and svmblockdiag(V / (A) A) = — symblockdiag^AA^ = —A. 

Here, we used both the fact that A is symmetric, block-diagonal and the fact that 
Xu = Id- Consequently, S = V/(A) — symblockdiag(V/(A)A) = S — A + A = SP0. 

The last point also shows there exists only one pair (S', A) certifying A is a KKT point. □ 

Notice how the Riemannian structure underlying problem (P) made it possible to simply 
read off an analytical expression for a dual certificate from the necessary optimality conditions 
of (RP p ). This smooth geometry also leads to uniqueness of the dual certificate (this is 
connected to nondegeneracy [5, Thm. 7]). Theorem 3.3 makes for an unusually comfortable 
situation and will be helpful throughout the paper. 

For convex /, we can make the following statement regarding uniqueness of the solution. 

Theorem 3.4. Assume f is convex. If X £ C is an extreme point for (P) (which is true in 
particular if rank(A) = d), and S P 0, and rank(A) + rank (S') = n (strict complementarity), 
then X is the unique global optimizer of (P). 

Proof. From Theorem 3.3, it is clear that A is a global optimizer. We prove by contradiction 
that it is unique. Let X' / A be another global optimizer. Since (P) is a convex problem in 
this setting, / is constant over the whole (optimal) segment 1 H > X + 1{X' — A) for t E [0,1]. 
Hence, the directional derivative of / at A along A = X' — X is zero: 

0 = (V/(A), A) 

= (S + symblockdiag(V/(A)A), A) (definition of S (21)) 

= (S, X) (diagonal blocks of X are zero) 

= {S, X') (SX = 0). 

®For Y a critical point of (RPp) with p = d, we have Y = ( y, T • y7) T with orthogonal Yds, and SY = 0. 
Write V/(yy T ) := — C for short. SY = 0 implies, after some algebra, that the (CYY T )u’s are symmetric. 
This can be used to see that S = diag(Yi,..., Y m ) T 5(yy T ) diagfli,..., Y m ) is a matrix with off-diagonal 
blocks equal to —Y^CijYj and diagonal blocks equal to Y, Cj^Yj. Thus, S is exactly the Laplacian of 
the graph with m nodes and edge “weights” given by the matrices Y.f C,., K,. 
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Since both S and X' are positive semidefinite, it ensues that SX' = 0. (Note that for linear 
/, this shows S is the dual certificate for all global optimizers of (P), not only for X.) Hence, 
SX = 0. 

Let p = rank(Jf) and Y E M nxp be a full-rank matrix such that X = YY T . Strict comple¬ 
mentarity and SX = 0 imply the columns of Y form a basis for the kernel of S. Hence, SX = 0 
ensures that X = YAY r for some A E E> pxp , Afi=- 0, such that symblockdiag(yHK T ) = 0. In 
other words, X' = X + X = Y(I p + A)Y^ is in the face J~x- This is a contradiction, since 

Xx = {-V}. fi 

Remark 3.5. In general, the condition that X be an extreme point in the previous theorem 
cannot be removed. Indeed, if f(X) = 0, then all admissible X’s are globally optimal and 
S(X ) = 0^0. In particular, X = I n satisfies strict complementarity, but if m > 1, it is not 
extreme, and it is not a unique global optimizer. Likewise, any rank-d admissible X is extreme 
and globally optimal, but does not satisfy strict complementarity. Similar examples can be built 
with nonzero linear costs f(X) = (C,X), where the sparsity pattern of C corresponds to a 
disconnected graph. 

Conversely, it is not true in general that uniqueness of the global optimizer implies ex¬ 
tremeness or strict complementarity. Simply consider f(X) = ||X — Xo||p with Xq E C: the 
global optimizer X = Xq is unique, andVf(X o) = 0, so that S(Xq) = 0. Form large enough, 
Xq can be chosen to be both not extreme and rank deficient. For an illustration of this in 
semidefinite programs, see the nice example after Theorem 10 in [5]. 

Continuing with convex cost functions, KKT points of (P) coincide with global optimizers. 
This and the fact that (RP p ) is a relaxation of (P) lead to the following summary regarding 
global optimality conditions. For (RP p ), these sufficient conditions are conclusive whenever 
the relaxation is tight. 

Corollary 3.6 (Global optimality conditions). Assume f is convex and let Y E St (d,p) m ; 
X = YY r is globally optimal for (P) if and only if S Y 0. If so, then Y is a global optimizer for 
the nonconvex problem (RP p ). Furthermore, if X is extreme (in particular, if rank(X) = d) 
and if rank(X) + rank(S') = n, then X is the unique global optimizer of (P) and Y is the 
unique global optimizer of (RP p ), up to orthogonal action YQ, Q E O (p). 

Returning to the general case of / not necessarily convex, we establish links between 
second-order critical points of (RP p ) and KKT points of (P). In doing so, it is useful to 
reformulate the second-order optimality condition (20) on (RP p ) in terms of S (21) and /. 
Let Y E St {d,p) m and X = KK T . Then, Y is a second-order critical point for (RP p ) if 
and only if it is critical and, for all Y E TySt {d,p) m , with X = YY T + YY T , it holds that 
(using (11)): 

<Jy,Hess 5 (K)[y]^ = (Y,V 2 g{Y)[Y] - symblockdiag(Vc/(y)y T ) Y^ 

= 2 (Y, V 2 f(X)[X]y ' + Xf(X)Y - symblockdiag(V f(X)X) Y) 

= 2(Y, V 2 f(X)[X]Y + SY) 

= (X,X7 2 f(X)[X]) + 2(Y,SY) >0. (22) 

Since (RP p ) is essentially equivalent to (P) with the additional constraint rank(X) < p, 
assuming Y is optimal for (RP p ), we expect X to be a KKT point at least if either of the 
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following holds: (1) if Y is rank deficient, since then the extra constraint is not active, meaning 
it is “as if” we were solving (P); or (2) if p = n, since then the extra constraint is vacuous. 
The two following theorems show this still holds for second-order critical points Y. 

Theorem 3.7. IfY is a rank-deficient, second-order critical point for (RP p ), then X = YY T 
is a KKT point for (P). 

Proof. By Theorem 3.3, we must show that S (21) is positive semidefinite. Since Y is rank 
deficient, there exists zGP such that z / 0 and Yz = 0. Furthermore, for all x G M n , the 
matrix Y = xz T is such that YY T = 0. In particular, Y is a tangent vector at Y ( 6 ). Since 
Y is second-order critical, inequality (22) holds, and here simplifies to: 

(Y, SY) = (xz T , Sxz T ) = ll^ll 2 • x T Sx > 0. 

This holds for all x G M n . Thus, S is positive semidefinite. □ 

Theorem 3.8. IfY is square (p = n) and it is a second-order critical point for (RP n ), then 
X = YY' t is a KKT point for (P). If Y is ftdl-rank, it needs only be first-order critical for 
X to be a KKT point. 

Proof. If Y is rank deficient, then the result follows from Theorem 3.7. If Y is full rank, then 
it is invertible. Since Y is also a critical point, first-order optimality conditions (19) imply 
SY = 0, hence S = 0. This completes the proof, as per Theorem 3.3. □ 

In particular, if / is convex and (P) has a unique solution of rank r, then all second-order 
critical points of (RP p ) have rank either r or p. Thus, if p is larger than the rank of a solution 
of (P), we may hope that minimizing (RPp) until we reach a second-order critical point will 
result in a rank-deficient Y, revealing a KKT point. Unfortunately, in general, we cannot 
guarantee rank deficiency beforehand. For those cases, the following theorem and corollary 
provide a means of escaping unsatisfactory critical points. 

Theorem 3.9 (Escape direction (rank-deficient)). Let Y be a rank-deficient critical point 
for (RP p ) such that X = YY T is not a KKT point of (P). Then, for all nonzero vectors 
z G M. p and u G M n such that Yz = 0 and u T Su < 0 (21), Y = uz T G TySt(d,p) m is a descent 
direction for g from Y. 

Proof. By Theorem 3.3, u exists because X is not a KKT point. Since P'K T = 0, Y is indeed 
tangent at Y ( 6 ). From (22), it follows that 

(Y, Hess g(Y)[Y}) = 2 (Y, SY) = 2\\z\\ 2 • u T Su < 0. 

As a result, this truncated Taylor expansion holds (we use the fact that the retraction ( 12 ) is 
second-order and assume ||z|| = 1 ):‘ 

4>(t) := g(Retractiony (tY)) = g(Y ) + ( u T Su)t 2 + 0(t 4 ). 

Since u T Su < 0, there exists to > 0 such that <f(t) < fi(0) for all t in ]0,fo[- 0 

7 We remark that the residue is of order 4 rather than 3, since (j> is an even function of t. Indeed, let 
J — diag(l, •.., 1, —1) € O(p) and consider the SVD of Y = U EF , such that the last column of V is 
z/\\z\\. Then, Y - tY = (YV - tYV)V T = (Y + tY)VJV r , because the last column of YV is zero, and YV’s 
only nonzero column is the last one. Hence, Retraction}-( — tY) = Retractiony (tY)VJV T . These yield the 
same matrix A', thus the same value of g. If the residue is bounded by it 4 , then t = ^/—u T Su/2L ensures 
<j>{t) < m — {u 1 Su) 2 /AL. In particular, for f(X) = (C,X), the fourth-order term follows and could be used 
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Corollary 3.10 (Escape direction (full-rank)). Let Y be a full-rank critical point for (RP p ) 
such that X = YY T is not a KKT point of (P). Let Y + = (Y 0 nx ( p+ -p)) £ St (d,p+) m . 
Then, (a) g(Y + ) = g(Y); (b) Y + is a critical point for (RP p+ ); and (c) for all uGl" such 
that u T Su < 0, Y = ue y+i ® Ty + St(d, p+) m is a descent direction for g from Y + , where 
e p +i G M p+ is a zero vector except for its (p + l) st entry, equal to 1. 

Proof. Since Y + Yj = YY r , Y + is indeed feasible for (RP p+ ) and g(Y + ) = g(Y). Since Y is 
critical, SY = 0, so SY+ = 0: Y + is a critical point. The rest follows from Theorem 3.9. D 

Later in this section, we show how to escape full-rank points without increasing the rank, 
under additional assumptions—see Proposition 3.17. 

An important question remains: for moderate p, should we expect to encounter second- 
order critical points Y that do not correspond to KKT points of (P)? We provide partial 
answers for concave / below. In particular, this covers the important case of linear costs. The 
stronger results do not include strictly convex functions, as for these (P) can have solutions 
of arbitrary rank. 

The result below shows that Y can only be a critical point of (RP p ) if YY T is a critical 
point for (P) restricted to the face Tx', and similarly for second-order critical points. This 
brings a useful corollary. 

Lemma 3.11. Let Y G St (d,p) m and X = YY T G C. IfY is a critical point (resp., a second- 
order critical point) of (RP p ), then X is a critical point (resp., a second-order critical point) 
o/min^-gj. f(X), where Tx (13) is the face of C which contains X in its relative interior. 

Proof. If Y is first-order critical, then SY = 0 (21). This implies that V/(X) is orthogonal 
to all directions X = YY T +YY T for Y G TySt(d,p) m . Indeed, symblockdiag(X) = 0, hence 

(Vf(X),X) = (S,X) = 0. 

(The subspace spanned by all such X’s has dimension np — _ EiL_Jd_) From (13), ob¬ 

serve that X is parallel to the face J~x iff X = YAY T for A G E, pxp with symblockdiag(YAY T ) = 
0. Thus, V/(X) is, in particular, orthogonal to Tx- This shows X is a first-order critical 
point for min^^ /(X), since X is in the relative interior of the face. Further consider (22) 
for second-order critical Y and all X = YAY T . Since Y = \YA and SY = 0, it follows that 

o < (X,V 2 /(X)[X]) + 2(Y,SY) = (X,V 2 /(X)[X]). 

Thus, X is a second-order critical point for the face-restricted optimization problem. Ed 

The following corollary can be put in perspective with [20, Thm. 3.4], The latter states 
a similar result for (P) with general linear equality constraints, for linear /. Their result 
characterizes local optimizers of (RP p ), whereas the following result characterizes first- and 
second-order critical points (computationally more manageable objects). 

to estimate L, to be used in a line-search for the escape: 

jt 4 ■ [(C, AX A) + u T D(3symblockdiag(CX) - 4 C)w] + 0(f), 

with A = symblockdiag(uM T ) and D = diag(||ui|| 2 ,..., ||um|| 2 ) ® Id- 
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Corollary 3.12. Theorem 3.7 and Lemma 3.11 imply the following, for Y E St (d,p) m and 
X = YY T € C. 


• If f is linear and Y is critical, then f is constant over Fx- If furthermore Y is second- 
order critical and p > |_ P*\ (16), then either X is globally optimal for (P), or (Y has 
full rank and) the face Fx has positive dimension (15) and is suboptimal. 

• If f is strongly concave and Y is second-order critical, then X is an extreme point 
(since X = YAY T ± 0 =► {Y ,Hess g(Y)[Y}) = (X, V 2 f(X)[X}) < 0, a contradiction). 
In particular, all second-order critical points of (RP p ) have rank at most [p*J, and 
p > |_p*J => X is KKT (since Y must be rank-deficient). 

• If f is convex (resp., strictly convex) and Y is critical, then X is optimal (resp., the 
unique optimizer) for (P) restricted to Fx- 


For linear /, Corollary 3.12 is not quite sufficient to determine how large p must be to ex¬ 
clude “bad” second-order critical points. Paraphrasing the comment following [20, Thm. 3.4], 
the latter showed that, for linear / and p > p* (essentially), local optima of (RP p ) are global 
optima, with the caveat that positive-dimensional faces (over which f must be constant) may 
harbor non-global local optima. In the literature, this has sometimes been quoted as saying 
that local optima are global optima if / is not constant over any proper face of C [see, e.g., 
46, footnote 3], but there is no indication that this is a mild condition. 8 

We thus set out to further refine the implications of second-order criticality of Y. We do 
so by leveraging the tight relationship (22) between the Hessian of the cost on (RP p ) and the 
dual certificate S. 


Theorem 3.13. Assume f is concave. Let Y E St (d,p) m be a second-order critical point 
for (RP p ). The matrix X = YY T belongs to the relative interior of the face Fx (13). IfY is 
rank-deficient, then S Y 0. IfY is full-rank, then S has at most 


dim Fx — A 

p 


(23) 


negative eigenvalues. Hence, if X is not a KKT point for (P), then it has rankp and dim Fx > 
A +p. 


Proof. Given Theorems 3.3 and 3.7, we need only focus on full-rank T’s. Since Y is second- 
order critical, inequality (22) holds. Since furthermore / is concave, (X, \/ 2 f(X)[X]) < 0 for 
all X, so that (Y,SY) > 0 for all Y E TySt(d,p) m . Let k = dimSt(d,p) m and U E R npxk , 
U T U = F denote an orthonormal basis of the space spanned by the vectorized tangent 
vectors vec(T). Then, U T (I p <S> S)U is positive semidefinite (for linear /, it inherits the 
spectrum of ^Hessg'(T)). On the other hand, since Y is a critical point, SY = 0. Let 
V E M n P x P 2 , V T V = I p 2 denote an orthonormal basis of the space spanned by the vectors 
vec (YR) for R E M pxp . Clearly, (I p (g) S)V = 0. Let k' denote the dimension of the space 

8 In fact, we found in numerical experiments (not reported) that, for d = 1, A = 1 (thus, almost all faces 
have dimension 1 and p > p*) and a random linear cost (C, A'), we could easily find a face J~x of dimension 1 
over which the cost is constant but not optimal. 
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spanned by the columns of both U and V, and let W E M. npxk ', W T W = Ik> be an orthonormal 
basis for this space. It follows that M = W T (I p < 8 > S)W is positive semidefinite. 

Let Ao < • • • < A n _i denote the eigenvalues of S. Likewise, Ao < • • • < A np _i denote the 
eigenvalues of I p <8) S. These are simply the eigenvalues of S , each repeated p times, thus: 
A i = Au/pj. Let p,Q < ■■■ < pk'-i denote the eigenvalues of M. The Cauchy interlacing 
theorem states that, for all i, 


A i < ' Pi ^ A i+np—k'. 


(24) 


In particular, since MY 0, we have 0 < po < \{ n p-k')/p\- It remains to determine kl. 

From Section 2.1, recall that k = np — md{d + l)/2. We now investigate how many 
new dimensions V adds to U. All matrices R E M pxp admit a unique decomposition as 
R = R s kew + RkerC + -^(ker C) x i where R s kew is skew-symmetric, RkeiC is in the kernel of 
Lx (14) and R(kerC) ± i s i n the orthogonal complement of the latter in § pxp . Clearly, YR s k ew 
and Ti?ker£ are tangent vectors, thus vectorized versions of these are already in the span of 
U. On the other hand, by definition, YR^kerC) 1 - i s n °t tangent at Y (if it is nonzero). This 
raises k' (the rank of W) to: 

,/ , , 2 P(P-!) j. 7 - d{d+ 1 ) p(p+ 1 ) 

k =k + p ---dim Yx = np — m --- 1 ---dim rx ■ (25) 

Combine with ^Unp-k')/p\ — 0 an( I the definition of A (15) to conclude. Q 

Theorem 3.13 is particularly meaningful for linear /, considering the intuition that for 
p > p*, generically, dim J~x = A > 0 (15) (Theorem B.l gives a proof for d = 1). Thus, 
for such p, a second-order critical point is either globally optimal, or it maps to a face of 
abnormally high dimension, over which / must be constant and suboptimal. We could not 
produce an example of the latter. We summarize this in a corollary, followed by a question. 

Corollary 3.14. Assume f is linear and fix p > p* , hence A > 0. If Y E St(d,p) m is a 
second-order critical point for (RP p ) but X = YY T is not a global optimizer for (P), then 
rank(X) = p, dim Tx > A + p and f is constant over Tx- 

Proof. Use Corollary 3.12, Theorem 3.13 and Theorem 3.7. □ 


For linear f(X) = (C,X) and p > p *, the question is the following: if C is sampled 
uniformly at random from the unit-norm symmetric matrices, what is the probability that / 
is constant over a face Tx of dimension A + p or larger, with rank(X) = pi If it is zero, 
then almost surely all second-order critical points of (RP p ) are global optimizers. We do not 
answer this question here, but refer to Theorem B.l to argue that there are few such faces. 

Theorem 3.13 is motivation to investigate upper-bounds on the dimensions of faces of C. 
The following result extends [42, Thm. 3.1 (i)] to d > 1. 

Theorem 3.15. If X E C has rank p, then the face Tx (13) has dimension bounded as: 


p(p+ 1 ) d + 1 
- n - 


< dim Tx < 


p(p + 1 ) d +1 


P~ 


If p is an integer multiple of d, the upperbound is attained for some X. 


(26) 
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Proof. Inequality (15) covers the lower bound. It remains to show that Cx{A) = 0 imposes 
at least p(d + l)/2 linearly independent constraints on A £ § pxp . Let Y £ St (d,p) m be such 
that X = YY t . and let yi ,..., y n £ M p denote the rows of Y, transposed. Greedily select p 
linearly independent rows of Y , in order, such that row i is picked iff it is linearly independent 
from rows y\ to yi-i- This is always possible since Y has full rank. Write t = {H < ■ ■ ■ < t p } 
to denote the indices of selected rows. Write s k = {((fc — 1 )d + 1),..., kd} to denote the 
indices of rows in slice Y k , and let c k = s k n t be the indices of selected rows in that slice. 

For xi ,..., x p £ M p linearly independent, the p(p + 1)/2 symmetric matrices xpxj + XjxJ 
form a basis of S pxp —see for example [42, Lem. 2.1]. Defining E t j = ypyj + y-yj = Eji , this 
means £t = {Eti,t,, ■ £,£' = 1 ■■ - p} forms a basis of § pxp . Similarly, since each slice Y k has 
orthonormal rows, the matrices {Eij : i,j £ s k } are linearly independent. 

The constraint Cx(A) = 0 means {A, Eij) = 0 for each k and for each i. j £ s k . To 
establish the theorem, we need to extract a subset T of at least p(d+ 1)/2 of these md(d+ 1)/2 
constraint matrices, and guarantee their linear independence. To this end, let 


T = {E^ : k £ {1,. .. ,m} and i £ c k ,j £ s k }. (27) 

That is, for each slice k, T includes all constraints of that slice which involve at least one of the 
selected rows. For each slice k, there are \c k \d— \ c A\ c k\ —i) g UC h constraints — note the correction 
for double-counting the E^s where both i and j are in c k . Thus, using |ci| + ■ ■ • + |c m | = p, 
the cardinality of T is: 

m = Y ic*|d - —) = p(d + 1 /2) _ W ic t i 2 . (28) 

k =1 k =1 

We first show matrices in T are linearly independent. Then, we show |T| is large enough. 

Consider one E Vj £ T: i = te for some i (otherwise, permute i and j ) and i.j £ s k for 
some k. By construction of t, we may expand y :] in terms of the rows selected in slices 1 to 
k , i.e., yj = a jpVt t n where l k = |ci| + • • • + |cfc|. As a result, E tJ expands in the basis 

£t as follows: E t j = Yhef=i a j/'Et e ,t e ,- As noted before, E^s contributed by a same slice k 
are linearly independent. Furthermore, they expand in only a subset of the basis, namely: 
£[ k ' 1 = {E te j e , : £ k -i < £ < £ k ,£' < £ k }- For k / k£^ and £j k are disjoint. Hence, 
elements of T are linearly independent. 

It remains to lowerbound (28). To this effect, use |c*;| < d to obtain: 

12 - n 112 

c k \ < max ||x|| 2 = 

: Halloo ®|| i=p 

Indeed, the maximum is attained by making as many of the entries of x as large as possible— 
this can be verified using KKT conditions. In combination with (28), this confirms at least 
p(d+1 /2) —pd/2 = p(d+1)/2 linearly independent constraints act on A, thus upperbounding 
dim Ex ■ 

To conclude, we argue that the proposed upperbound is essentially tight. Indeed, build 
Y by repeating m times the d first rows of I p , then by replacing its p first rows with I p (to 
ensure Y is full-rank). If p/d is an integer, then exactly the p/d first slices each contribute 
d{d + l)/2 independent constraints, i.e., dim E yy t = p(p + l)/2 — p(d + l)/2. □ 



d 2 + (p - — d) < pd 
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Theorems 3.3, 3.13 and 3.15 combined give a sufficient condition on p to ensure all second- 
order critical points of (RP p ) correspond to KKT points of (P). 

Corollary 3.16. Assume f is concave and p > ^r^n. If Y € St (d,p) m is a second-order 
critical point for (RP p ), then X = YY T is a KKT point for (P). If furthermore f is linear, 
then all second-order critical points of (RP p ) are global optimizers. 

Proof. Since rank(A) < p, we have dimJ-A — A < (n — p)^4. Theorem 3.13 then gives 
5^0 (21) if (n — p)(d + 1) < 2 p, which is the case. Apply Theorem 3.3 to conclude. □ 

In particular, for the Max-Cut SDP [d = 1, / linear), this shows that computing a second- 
order critical point of (RP p ) with p = \n/2 J +1 certainly solves (P). This is an interesting 
and new result, but of course, in practice, it is desirable (and empirically sufficient) to take 
p = |yj + 1 (much smaller). In the unlikely event we would encounter a “bad” second-order 
critical point with such p, the following theorem provides an escape route (for concave /) 
which does not require increasing the rank. It proceeds by moving inside a face. 

Proposition 3.17 (in-face rank reduction). Let Y 6 St (d,p) m have full-rank, X = YY T , and 
consider the symmetric operator TL on EP xp defined by TL(A) = P' T symblockdiag(yAT T ) Y. 
TL is positive semidefinite and dim ker TL = dim Fx ■ If A E ker TL is nonzero, then X' = 
k (Ip A/A m i n (A))Y t E Fx (on the boundary) and rank(A') < p — 1. 

Proof. Recall the definitions of Fx (13) and Cx (14). All follows from TL = C* X C X , where 
C* x is the adjoint of Cx- □ 

The latter proposition suggests an explicit numerical method to compute A, by computing 
a minimal eigenvector of TL. Applying TL costs 0(m(d 2 p+p 2 d )) flops. Assuming p = |_p*J +1 = 
Q(d\fm) and that up to p(p+1)/2 applications are necessary, this brings the cost of computing 
A to 0(d 2 n 3 ) flops. 

4 The Riemannian staircase algorithm 

The above results suggest a simple algorithm to compute KKT points of (P): for some small 
value of p > d + 1, find a second-order critical point Y of (RP p ). If Y is rank deficient, 
then Theorem 3.7 guarantees X = KK T is KKT for (P). Otherwise, increase p and find a 
second-order critical point of (RP p+ ), possibly warm-starting as suggested by Corollary 3.10. 
Iterating this procedure, the worst-case scenario is when p increases all the way to n, in which 
case any second-order critical point of (RP n ) yields a KKT point of (P), as per Theorem 3.8. 
Specific results pertaining to classes of functions / limit how large p could grow. We call this 
the Riemannian Staircase, listed as Algorithm 1. Of course, the hope is that the algorithm 
returns for some small p, and in practice we find that it is often sufficient to take p just above 
the rank of a solution. 

Algorithm 1 assumes availability of a procedure RiemannianOptimization(A4, g, F 0 ), 
which returns a second-order critical point of g \ M —> M, with cost at most g(Yo). This 
assumption is discussed below. 

Inside the else- block, the augmented Yi (with additional columns of zeros) is (usually) 
a saddle point. Although the second-order procedure RiemannianOptimization should be 
able to escape it, we make this step explicit via the procedure EscapeDirection. The latter 
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Algorithm 1 Riemannian Staircase Algorithm 
1: Input: Integers d < p\ < P 2 < • • • < Pk < n\ an initial iterate Yq e St(d 1 pi) m . 

2: for i = 1... k do 

3: Yi <— RiEMANNiANOPTiMlZATiON(St(d,pi) m , g, l)_i) > Descent to 2nd order critical 

4: if i = k or rank(K) < pi then 

5: return Yi > Theorems 3.7 and 3.8 

6: else 

7 : Yi (Yj, 0 n x(pi +1 -pi)) > Augment Yi for the next rank 

8 : Z <— EsCAPEDiRECTiON(St(d,pi+i) m , g, Yi) > Corollary 3.10 + line-search 

9: Yi <— Retraction^ (Z) > Eq. (12) 

10: end if 

11: end for 


can be implemented using Corollary 3.10, which indicates how computing an eigenvector of 
S ( 21 ) associated to its smallest eigenvalue, combined with a line-search, allows to escape the 
saddle with strict cost decrease (unless that eigenvalue is nonnegative, in which case Z = 0 
and Yj is returned with Y i Y i T being KKT). 

For all sufficiently smooth /, taking pk = n guarantees Algorithm 1 returns Y such 
that YY t is a KKT point. For convex /, KKT points may have arbitrary rank, so that 
allowing large pk seems necessary in general. For strongly concave /, it is sufficient to take 
Pk = |_P*J +1 (Corollary 3.12) ; for concave (and linear) /, it is sufficient to take pk = |_ 2 + 3 n J +1 
(Corollary 3.16), and it is expected that pk = |r*J + 1 should be sufficient (Corollary 3.14 
and discussion). 

In the latter case, in the unlikely event that Algorithm 1 terminates with Y of size nxpk , 
Pk > It*.] + 1) full-rank and second-order critical such that X = YY T is not a KKT point 
of (P), it is possible to further optimize without increasing the rank. Indeed, since dim J~x > 0, 
Proposition 3.17 shows how to compute Y' such that X' = Y'(Y ') T is on the boundary of J~x- 
Since / is concave, f(X') < f(X) (Lemma 3.11). Y' is critical and rank-deficient. If Y' is 
second-order critical, X' is KKT. Otherwise, Theorem 3.9 shows how to escape with a strict 
cost decrease. Iterating this procedure as needed, the cost decreases strictly (no cycling), and 
the rank p never exceeds pk- We expect this procedure to terminate since (P) admits KKT 
points of rank at most [p* J , but we do not prove this. 

In practice, for the RiemannianOptimization procedure, we use the Riemannian trust- 
region method (RTR) [3], through the Manopt toolbox [17]. RTR is a descent method. It 
converges toward critical points regardless of the initial iterate (global convergence ). 9 Fur¬ 
thermore, the stable fixed points of RTR are local optimizers, thus making convergence to 
points which are not second-order critical unlikely (but not impossible). Should this happen, 
Theorem 3.9 shows how to escape. Admittedly, it is unclear how many times this might have 
to be repeated in the worst case. 

Ideally, one would modify the RTR algorithm itself to ensure global convergence to second- 
order critical points. To the best of our knowledge, algorithms with such properties have not 
yet been described in the Riemannian setting. Nevertheless, we are hopeful that this should 

9 If the local optimizers of g were isolated, we could also guarantee local convergence at a quadratic rate, 
but g(Y) = g(YQ) for all orthogonal Q, so this is never the case. In practice though, we do observe a 
characteristically superlinear convergence. 
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be possible, in the light of recent work by Cartis et al. [23]. These authors indeed describe 
a modification of the classical trust-region method and guarantee polynomial-time conver¬ 
gence to approximate second-order critical points. Encouragingly, Sun et al. [70] achieved a 
strong result in this vein for dictionary learning with RTR on a sphere, hinting to a possible 
generalization on manifolds. 

RTR terminates once the norm of grad g drops below a certain threshold. Thus, the 
returned Y is not exactly a critical point, and as a result it is not, in general, exactly rank 
deficient even when it should be. Numerically, we declare rank-deficiency when the condition 
number of Y T Y exceeds some large threshold (say, 10 10 ). 

If a solution of rank q is sought but the obtained solution Y p £ St (d,p) m has rank p > q, 
one heuristic is to project Y p to St(d, q) m with any reasonable algorithm (call it Y q )—for 
example, compute the thin SVD of Y p = U"EV T , retain only the first q columns of C/E and 
orthonornralize each d x q slice. Then, run RlEMANNIANOPTIMlZATION(St(d, q ) m , g, Y q ). This 
typically returns a local optimizer of the hard problem. Experience shows the detour via the 
higher dimensional relaxation may help avoid bad local traps. 

5 Special case: linear cost function 

In the important special case where / is a linear function f(X) = (C , X) for some data matrix 
C £ § nxn , the convex problem (P) is an SDP. As per Remark 2.2, it is equivalent to (RP p ) 
as soon as p > p*. Remarkably, for any X € C, we obtain a lower-bound on the optimal value 
of the SDP, following an idea from Burer and Monteiro [20, §6.1]. 

Proposition 5.1 (bounds on the SDP value). Let f(X) = (C,X) be linear and let f* denote 
the optimal value of (P). Then, for all X £ C, 

f(X) + n-X min (S(X )) < /* < f(X). 

Proof. The dual of (P) is the following SDP: 

max trac e(C — S ), s.t. C — S is symmetric, block-diagonal, and 5^0. 

The matrix S = S — A m i n (S) -/ n for S = S(X) (21) is admissible. The result follows by strong 
duality, owing to Slater’s condition. □ 

Algorithm 1 solves the SDP by optimizing g in (RP p ), whose differentials are: 
g(Y) = trace(T T Cy), Vt/(T) = 2 CY, V 2 g(Y)[Y} = 2 CY. 

As an illustrative example, we here apply Algorithm 1 and competing SDP solvers to 
random instances of the orthogonal synchronization problem [9]. In this setting, one wishes 
to estimate m orthogonal matrices Qi,, Qm, based on noisy measurements of the relative 
transformations QiQj. See the introduction for applications. 

In this benchmark, for increasing values of m, target matrices of size d = 3 are generated 
uniformly at random. The measurements of relative rotations are Hij = QiQj + crNij (i < j), 
where u = 0.3 is the noise level and the N l3 's are independent random noise matrices with 
i.i.d. normal entries. We also set Hji = Hj- and Ha = Id- To estimate the Qf s from the Hij' s, 
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Figure 2: All methods solve (P) on problems from Section 5. The proposed staircase algorithm 
is the only one to return a solution which satisfies the constraints up to machine precision. 
It also returns the solution Y which is numerically closest to be of rank d. Its computational 
cost seems to grow at the same rate as that of merely computing d dominant eigenvectors of 
the data matrix (EIG), thus outperforming interior point methods as well as SDPLR. 


we set C = —H/(nm) and solve (P). If the solution has rank d, this is equivalent to solving 
the maximum likelihood problem: 


min 

Ql,~,Qm£0(d) 


^2 ll^t? QiQjWp- 


Remarkably, for all instances generated, (P) admits a rank d solution, thus revealing the 
true maximum likelihood estimator: a hard quantity to compute, in general. This serendipi¬ 
tous phenomenon is partly explained in [10]. 

Figure 2 shows how much time it takes various solvers to find this solution of rank d (they 
all do). Algorithm 1 runs RTR once on (RP p ) with p = d + 1, with a random initial guess, 
and returns with an optimal rank d solution. We compare against interior point methods 
SeDuMi [69], SDPT3 [74] and Mosek [48] (the latter two via CVX [39]) as well as against 
SDPLR [19] with and without forcing the search rank to d + 1 (the forced version is labeled 
SDPLR*). We also depict how much time it takes to simply compute the top d eigenvectors 
of H , which, after projection, reveal an (empirically) equally good estimator for this problem, 
but with weaker guarantees [65, 9]. 


6 Special case: Pseudo-Huber loss cost function 

In the previous section, orthogonal synchronization is considered with Gaussian noise on the 
relative measurements. Maximum likelihood estimation then naturally leads to the minimiza¬ 
tion of a quadratic cost in Y. which simplifies to a linear cost in X. 
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When the relative measurements H^ include outliers, least-squares are not expected to 
perform well. As an alternative, Wang and Singer [78] minimize a sum of unsquared errors, 
that is, they estimate the orthogonal matrices Qi as the minimizers of V- • || H-- — QaQJ\\f- 
A convex relaxation akin to the one from the previous section leads to solving (P) with the 
least unsquared deviations cost f(X ) = Yli j II Hij ~ -Xy|| F (LUD). This is similar in spirit to 
the convex relaxation for robust subspace estimation presented in [43]. The authors show 
that rounding the solutions of this convex program yields a good estimator, even if up to a 
(random) half of the data is random. In that regime, if the non-outliers are noiseless, solving 
the convex program achieves perfect recovery with high probability. They solve the problem 
with an alternating direction augmented Lagrangian method (ADM). 

Tools in this paper do not directly apply to the LUD cost, because it is nonsmooth. 10 
Unfortunately, in our experiments we also found that smoothing the LUD cost typically 
leads to higher rank solutions, at a significant computational premium. We formalize this 
observation in the following theorem. The assumptions on H are not restrictive: they require 
just the slightest inconsistency in the measurements. 

Theorem 6.1 (smoothing the LUD cost suppresses rank d solutions). Let £: M + — > M + 
be an increasing function (with l'(x) > 0 if x > 0) such that f: § nxn —y M defined by 
f{X) = J2ij t(\\Xij — || F ) is twice continuously differentiable, where each H- = Hj ver¬ 

ifies 11 Hij 11 0 p < 1 (which includes orthogonal matrices), and Ha = IIf H is not a rank-d 
matrix in C, then all KKT points of (P) have rank strictly larger than d. (Otherwise, X = H 
is the unique KKT point.) 

Proof. The gradient of / with respect to Xij is given by V f(X)ij = Wij(Xjj — Hij), with 
Wij = Wji = £'(\\Xij — Hij \| F )/ ||Xij — H ^|| F > 0 if Xij f H y, and Wij = 0 otherwise. This 
is well defined by assumption. For contradiction, assume A is a KKT point of (P) and 
rank(A) = d. By Theorem 3.3, S (21) is positive semidefinite. In particular, its diagonal 
blocks are positive semidefinite: 

Sii = Xf{X)a - sym ( ^ V/(X) ii l ii ) 

— y w l s y m (j^ijXij — x^) 

= Hj^ sym (HijXjj- /,/) >- 0. 

The last equality follows from the fact that, since rank(A') = d, each Xij is orthogonal 
(Proposition 2.1). Since ||iLjj|| 0 p < 1, each term sym {H-Xj-— If) is negative semidefinite. 
Thus, the Sf s are zero (simultaneously positive and negative semidehnite), showing that 
S' = 0 (by Schur’s complement). Hence, the off-diagonal blocks are zero too: Sij = V f(X)ij = 
(Xjj — H^) = 0, implying X = H: a contradiction. □ 

Remark 6.2. The latter theorem applies in particular for £(x) = x 2 . Thus, the convex 
problem (P) with f(X) = \\X — H || F is not expected to admit rank d solutions in the presence 
of even the smallest noise. This is in sharp contrast with the linear case, f(X) = —trac e(HX), 
even though these two costs differ only by a constant over the rank-d feasible X’s. The key 
difference is that the linear cost is also concave, pointing to concavity and nonsmoothness to 
promote rank-d solutions. 

10 Recent work on nonsmooth optimization on manifolds [41] may prove useful in this regard. 
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In view of these results, we take interest in minimizing the related smoothed cost: 
g(Y) = f{YY T ) = J2^{\\H ij Y j -Y i \\ F ), £ e ( x ) = \A; 2 + e 2 - £ ->■ |x| as £ ->■ 0. (29) 

Although it bears much resemblance with the convex LUD cost (they coincide when rank(X) = 
d and e = 0), this / is strongly concave in X. Indeed, \\HjjYj — Yi\\p is affine in Xij, so that 
/ is a sum of square roots of affine functions of X, and the terms under the square roots 

are larger than e 2 > 0 and smaller than e 2 + (\\H%j ||f + Vc?J ■ The following marks the 
dependence in X more explicitly: 


f(X) = \IWHiiWl + ||/d||| - 2 - e, 

-1 


hi 




s /\\H ij \\l + \\I d \\l-2(H ij ,X ij )+£‘ 


-II, 


v 


The only difference with a smoothed LUD cost is the term ||/d||p which appears instead of 
| X{j ||p. Considering that the aim is for the Xij 's to be orthogonal, which maximizes their 
norm by Proposition 2.3, refraining from minimizing ||Xjj||p appears as a good start. 

We may still compute a KKT point for (P), but there is no guarantee that such a point 
will be even a local minimizer anymore. On the bright side, Corollary 3.12 states that, by 
strong concavity of /, all KKT points of (P) are extreme points (thus they have rank at 
most [p*J (16)) and, for p > p* , all second-order critical points of (RP p ) reveal KKT points 
of (P). The numerical experiment below shows that, empirically, even for e > 0, the proposed 
algorithm typically converges to a rank-d KKT point of excellent quality. Furthermore, as e 
is decreased, the quality of the found KKT point increases (with warm-starting). 

We now use the proposed robust formulation of orthogonal synchronization to situations 
where the sought matrices are in fact permutations 11 (without modifying the algorithms). 
Synchronization of permutations notably arises in image association problems in computer 
vision [38, 52]. 

Let the Qi's be permutations to estimate and let the H tJ 's be measurements of the relative 
permutations QiQj- A subset of the measurements of a given size is selected uniformly at 
random and replaced by uniformly random permutations (outliers). The other measurements 
are correct. If perfect recovery of the Qi's is achieved, then the permutations are recovered. 

Figure 3 exhibits the perfect recovery phenomenon hinted by Wang and Singer [78]. We 
say “hinted” as the chosen scenario does not exactly fit the assumptions of these authors. Even 
in the face of many outliers, the true permutations are recovered, showing the applicability 
of the proposed methods to permutation estimation. 

In practice, we minimize / for some starting value e = 1, then re-solve for decreasing values 
down to e = 10~ 3 , warm-starting each new solve with the previous solution. The staircase 
method starts with a search rank p = d+1. For up to 80% outliers, RTR converges to a rank-d, 
second-order critical point of g ((Jd+ i(K) ~ 10~ 10 , ||gradg(y)|| < 10~ 6 and A m i n (Hessg(y)) > 
—10 10 , after scaling ) without the need to increase p , thus rapidly identifying a KKT point 
of (P) which appears to be a global optimizer. We compare with ADM [78] optimizing the 
LUD cost, but not with the IPM’s, as they rapidly run out of memory. 


11 Permutation matrices are binary matrices with exactly one 1 on each row and column. They are orthogonal. 
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Figure 3: Synchronization of m = 100 permutations of size d = 6. As explained in Section 6, 
for each pair of permutations we are given a relative permutation measurement. Some fraction 
of those are exact: this varies on the horizontal axis. The other measurements (selected 
uniformly at random) are uniformly random. When the mean squared error (vertical axis) 
is close to zero (say, below 10 6 ), the estimation is essentially perfect and we get back the 
true permutations. Remarkably, the staircase method with the pseudo-Huber loss cost (29) 
can accommodate up to 80% of outliers and still (empirically) achieve perfect recovery. It is 
faster and appears more resilient than ADM [78], but unfortunately, without access to the 
ground truth, we cannot claim we found a global optimum because (29) is concave. ADM, on 
the other hand, comes with guarantees as it solves a convex problem. 
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7 Conclusions and perspectives 


We proposed a novel algorithm to compute KKT points for optimization problems over a 
class of spectrahedra that come up in relaxations of various problems involving orthonormal 
matrices. Our approach consists in exploiting the smooth geometry of bounded-rank subsets 
of those spectrahedra, to reduce the problem to Riemannian optimization. This effectively 
allows one to control how much lifting (dimension increase) is involved in the relaxation. An 
investigation of both the convex and the Riemannian geometries of the total and the bounded- 
rank problem showed that, under certain conditions, it is only necessary to compute second- 
order critical points on a low-dimensional portion of the boundary of the spectrahedron. 
Numerical experiments confirm the usefulness of this observation. 

The present work triggers a number of questions for future investigation. 

• Which spectrahedra are such that their elements of bounded rank form a smooth mani¬ 
fold? (Journee et al. [40] cover a number of such sets.) When the search space is of such 
form with additional constraints, can those be accommodated efficiently? This would 
be useful to address the SDP’s in, e.g., [38, 60, 25, 11]. 

• What is the computational complexity of obtaining a second-order critical point of a 
sufficiently smooth function on a Riemannian manifold, up to a given accuracy? This 
might be answered by following work in [23, 70]. When Y is only approximately second- 
order critical and rank deficient, is YY T approximately KKT, in a certain sense? For 
linear /, Proposition 5.1 offers a positive answer. 

• For nonconvex /, the set of KKT points includes the local optimizers of (P), as well 
as a number of uninteresting points. All KKT points give rise to critical points, but 
not necessarily second-order critical points. Can this be used to improve guarantees? 
Could we compute second-order KKT points instead, thus possibly excluding even more 
spurious points? A starting point might be [57, Thm. 3.45] and [63]. 

• Assuming linear /, if (P) admits a unique solution of rank r (see, e.g., [10]), is it sufficient 
to explore (RP p ) with p = r + 1? Under the noise model of Section 5, this is observed 
empirically. Perhaps, this could be investigated via the expected size of the attraction 
basin of the global optimizers, similarly to [70] in the context of dictionary learning. 

• Finally, regarding Corollary 3.14 and its attached question: for a random linear cost 
function and p > p* , what is the probability that (RP p ) admits second-order critical 
points which are not global optimizers? 
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A Tightness for synchronization on a cycle 


Consider synchronization on a cycle: the goal is to estimate orthogonal matrices R m £ 

0(d) based on one cycle of measurements: ~ RiR 2 , H 2 ,3 ~ ..., H rn \ ~ 

This is achieved by solving (RP p ) with p = d and 

m m 

g(Y) = £ ITAm- %+illl = -%+i> + constant, 

Z=1 Z=1 


with the indexing convention that Y m+ \ = Y\ and H m , m+ 1 = id m ,i. Under a permissive 
condition on the measurements, Sharp et al. [64], Peters et al. [54] exhibit an explicit formula 
for the solution (they restrict their attention to rotation matrices, that is, orthogonal matri¬ 
ces of determinant +1). We show that under that same condition, the corresponding SDP 
relaxation (P) with 


f(X) = {C,X), 


c 


( 0 


Hj ,2 


\Hm, 1 


H 1,2 
0 


H 2 ,3 


ttT 


Hi, 1 


Hm—l,m 

0 


(30) 


is tight: there exists a unique solution of rank d which reveals the global optimum. 

The proof rests on two key ingredients: (a) we have an explicit formula for the solution X 
to certify, and (b) we have an explicit formula for the dual certificate S(X) to check. It seems 
reasonable to expect that the result should carry over to connected graphs whose cycles have 
disjoint edges. 

Recently, Zhang and Singer [83] gave a powerful tightness result for such SDP’s: as stated, 
their result is restricted to cycles of length 3, but it nicely accommodates non-orthogonal 
blocks in the data matrix C. 


Theorem A.l. Let Hi, 2 , TTo ,3 ,..., dd m ,.i £ 0(d) represent orthogonal measurements on a 
cycle (m > 3 ) and define their product P = Hi ,2 ■ -# 2,3 ■ ■ ■ id m ,i £ 0(d). If —1 is not an 
eigenvalue of P, then the semidefinite program (P) with cost (30) admits a unique solution of 
rank d. 


Proof. Part 1: guessing X. When the measurements are perfectly consistent, P = Id, and it 
is easy to construct X: set Y m = I,i and Y % = Hj^ + \Y l+ \ for i = 1... m — 1; then X tJ = YfYj. 
This construction does not use H m ,\ but still achieves g{Y) = 0 owing to P = Id, hence X 
is optimal. When the cycle is inconsistent, it is reasonable to guess that the least-squares 
criterion will attempt to spread the inconsistency evenly over each edge [64, 54]. One mth 
of the error is represented by an offset P l / m —taking this principal matrix root requires P 
not to have negative eigenvalues. We build Y by incorporating part of the error at each step, 
appropriately aligned. First define this recurrence: Q m = H m , 1 and Qi = Q%+\ for 

i = (m — 1)... 1 (note that Q\ = P). Then, Y m = Id and Y) = QiP~ 1 ^ in QjHi,i + iYi + i for 
i = (m — 1)... 1. As previously, X Vj = Y.{Yj. It is not hard to check that X t j = Q i P^ l ~^/ m Qj. 
Of course, X is admissible for (P) with rank d. 
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Part 2: certifying X. By Theorem 3.4, it is sufficient to verify that S(X) (21) is positive 
semidefinite with rank (m — 1 )d. Let U be a d x d unitary matrix such that D = U*PU is 
diagonal —U always exists since P is normal—and let V = diag(<3it/,..., Q m U) be a block- 
diagonal unitary matrix. We use V to operate a change of variables on C and X: 



(0 

Id 

D~ l \ 




Id 

o Id 




v*cv = - 


Id 

h 

, (V*XV)ij = 

(31) 


u 


Id 0 / 




We used that D is unitary. Indeed, since P is orthogonal without 1 as an eigenvalue, its 
eigenvalues are such that D = diag(e* 01 ,..., e l<d ) for 9 \,..., 9d £] — vr, vr[. The spectrum of 
S(X) is identical to that of V*S(X)V , thus we study: 


V*S(X) V 


/D 1 /™ + T> -1 / m -I d 

-I d D 1 /™ + D- 1 /™ -I d 


-Id 


\ ~D 


-D - 1 \ 


-Id 

-Id D 1 / m + D- 1 / m J 


(32) 


All blocks of (32) are diagonal, so that its rows and columns may be permuted (without 
affecting its spectrum) to make it block diagonal, with kth block of size m given by 


A = 


^2 cos (9k/m) —1 

— 1 2 cos (9k/m) 

-1 


-1 


—e 


iO h 


—e l6k \ 


-1 

-1 2 cos (Qk/m)J 


T 

u* 


(33) 


with T € u £ C m_1 and c = 2 cos (9k/m). It remains to show that A is positive 

semidefinite with rank m — 1 for any 9k £ ] — n, 7r[. Fortunately, T is tridiagonal and Toeplitz, 
so that its whole spectrum is known explicitly [51]: A j(T) = 2(cos (Qk/m) — cos(jir/m)) , for 
j = 1... m — 1. These eigenvalues are all positive. By the Cauchy interlacing theorem, 


Ai(A) < Ai(T) < A 2 (A) < A 2 (T) < ■ ■ ■ < A m _i(T) < A m (A). (34) 

In particular, A 2 (A),..., A m (A) > 0. Since the vector [e*( 1 / m ) e,fe , e d 2 / m ) 0 fe ; ..., e 1 ( m W fl t]* is in 
the kernel of A, it must be that Ai(A) = 0. This concludes the proof. tj 

In general, the condition on the eigenvalues of P is necessary. Indeed, for d = 1 and 
m = 3, choose the measurements such that P = — 1 (for example, +1, +1 and — 1) and verify 
that none of the 4 admissible rank-1 matrices are optimal. For the frequent case where the 
measurements are rotation matrices (that is, orthogonal with determinant +1), the condition 
is not too restrictive: even if they were distributed uniformly at random, P would satisfy the 
condition almost surely. 
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B Generic face dimension 


For d=l, the following theorem shows almost all faces of C have minimal dimension, as 
per the bound (15) [53, 12]. Key parts of the proof are due to Xiuyuan Cheng and Balazs 
Gerencser. 

Theorem B.l (Generically, faces have minimal dimension). For d = 1, if Y G St (d,p) m 
is selected uniformly at random, then, almost surely, dim F Y y t = max(0,A), with A = 

Pis+H _ m ^+ii ( 13 ). 

We first provide a useful lemma. 

Lemma B.2. Let v\,... ,v n be statistically independent random vectors in a vector space V 
of dimension k. If for all i and for all subspaces U C V with dimC/ < k, Pr[uj 6 [/] = 0, 
then, almost surely, dimspanjui,... ,v n } = min(n, k) (which is maximal). 

Proof. The proof is by recurrence. Define Ut = spanjui,..., vt} for t G {0,..., n}. Clearly, 
dim Co = 0. Assume dim Ut = min (t,k) almost surely (a.s.). If t > k, then dim[/f_|_i = 
dim 17* = k (a.s.). Otherwise, since Ut is statistically independent from vt+i, by assumption, 
dimCt+i = t + 1 (a.s.). Thus, for all t, dimUt = min (t,k) (a.s.). CJ 

Proof. Proof of Theorem B.l. Let y\,..., y n G MP denote the columns of Y T . A matrix 
X = YAY r with A G S pxp is parallel to the face Fx if (TAK t , e^e/) = (A, y t yj) = 0 for 
all i. We study the dimension s of the space spanned by the constraint matrices At = y t yj, 
since dim Fx = plyP ^ 1 ^ — s. We do so with Y taken uniformly at random. Thus, the yf s are 
sampled independently from A/"(0,/ p ), then scaled to unit norm. The dimension s does not 
depend on the scaling of the vectors yt, so we may safely ignore it. We note in passing that 
the proof holds for more general distributions too. 

We aim to apply Lemma B.2 with V = §p x p, k = and Vi = A*. The vfs are 

i.i.d., hence we omit the subscripts. To verify the lemma’s condition, let U be any proper 
subspace of § pxp : there exists a symmetric matrix I / 0 in the orthogonal complement of 
U. It suffices to check that (X,yy T ) = y T Xy / 0 (a.s.), with y ~ Af(0,I p ). Diagonalize 
X = QDQ t with Q G O (p). Notice that Q T y is distributed identically to y. Thus, defining 
D = diag(Ai,..., X p ) and y = (y 1 , ..., y p ) T , it suffices to check that J2j=i / 0 (a.s.). 

This is indeed true, since the (y- 7 ) 2 are independent: their (nontrivial) linear combination has 
a density which is a convolution of (scaled) xf densities: this has no point mass at zero. □ 

We expect that this result remains valid for d > 1, but we are missing a proof. 
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